431 Class 07

Thomas E. Love, Ph.D.

2023-09-19

Our Agenda

  • Setting up the dm1000 data
  • Identifying missing data and working around it
  • Assessing Normality
    • with histogram / boxplot / Normal Q-Q plot (good)
    • with numerical summaries, and with hypothesis tests (less good)
  • Summarizing Categorical Data

Our R Packages

library(broom)
library(Epi) ## for twoby2() function
library(gt) ## for making tables
library(gtExtras) ## for fancier tables
library(kableExtra) ## for kbl() function
library(janitor)
library(naniar) 
library(patchwork)
library(tidyverse) # always load tidyverse last

theme_set(theme_test()) # trying a new theme
knitr::opts_chunk$set(comment = NA)
  • I used #| message: false in the code chunk to silence messages about conflicts between R packages.

A New Data Set (dm_1000.csv)

Use Raw URL from 431-data page

url_temp <- "https://raw.githubusercontent.com/THOMASELOVE/431-data/main/data-and-code/dm_1000.csv"

dm1000 <- read_csv(url_temp, show_col_types = FALSE) |>
  clean_names() |>
  mutate(across(where(is.character), as_factor)) |>
  mutate(subject = as.character(subject))

1000 simulated adults (ages 31-75) living with diabetes, in Cuyahoga County, in one of four race/ethnicity categories and in one of four insurance categories.

  • One new variable as compared to dm431: residence (Cleveland or Suburbs)

dm1000 Code Book (1 of 3)

Variable Description
subject subject code (M-0001 through M-1000)
age subject’s age, in years
insurance primary insurance, 4 levels
n_income neighborhood median income, in $
ht height, in meters (2 decimal places)
wt weight, in kilograms (2 decimal places)

dm1000 Code Book (2 of 3)

Variable Description
sbp most recent systolic blood pressure (mm Hg)
dbp most recent diastolic blood pressure (mm Hg)
a1c most recent Hemoglobin A1c (%)
ldl most recent LDL cholesterol level (mg/dl)
tobacco most recent tobacco status, 3 levels
statin 1 = prescribed a statin in past 12m, 0 = not

dm1000 codebook (3 of 3)

Variable Description
eye_exam 1 = diabetic eye exam in past 12m,
0 = no record of exam in past 12m
race_ethnicity race/ethnicity category, 3 levels
sex Female or Male
county all subjects live in Cuyahoga County
residence Cleveland or Suburbs

Dealing with Missing Data

Any Missing Data?

gg_miss_var(dm1000)

Counts of missingness, by variable

miss_var_summary(dm1000)
# A tibble: 17 × 3
   variable       n_miss pct_miss
   <chr>           <int>    <dbl>
 1 ldl               178     17.8
 2 n_income           28      2.8
 3 residence          28      2.8
 4 ht                 19      1.9
 5 tobacco            16      1.6
 6 a1c                15      1.5
 7 wt                  6      0.6
 8 sbp                 6      0.6
 9 dbp                 6      0.6
10 subject             0      0  
11 age                 0      0  
12 insurance           0      0  
13 statin              0      0  
14 eye_exam            0      0  
15 race_ethnicity      0      0  
16 sex                 0      0  
17 county              0      0  

Other useful functions

miss_var_table(dm1000)
# A tibble: 7 × 3
  n_miss_in_var n_vars pct_vars
          <int>  <int>    <dbl>
1             0      8    47.1 
2             6      3    17.6 
3            15      1     5.88
4            16      1     5.88
5            19      1     5.88
6            28      2    11.8 
7           178      1     5.88
miss_case_table(dm1000)
# A tibble: 7 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     772      77.2
2              1     181      18.1
3              2      32       3.2
4              3       7       0.7
5              4       5       0.5
6              5       2       0.2
7              6       1       0.1

How should we summarize data with missing values?

It depends on what you’ll do with the data.

  • If you are providing a data summary, then you should summarize the complete cases, and specify the number of missing values.
  • If you are intending to use the sample you’ve collected to make an inference about a process or population or to build a model, then you may want to consider whether or not a complete-case analysis will introduce bias.

What do graphs do with missing data?

ggplot(data = dm1000, aes(x = ldl)) +
  geom_histogram(bins = 20, fill = "slateblue", col = "cyan")
Warning: Removed 178 rows containing non-finite values (`stat_bin()`).

Exploring ldl in dm1000

p1 <- ggplot(dm1000, aes(sample = ldl)) +
  geom_qq(col = "slateblue") + 
  geom_qq_line(col = "magenta") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: dm1000 LDL",
       y = "Actual LDL Cholesterol (mm Hg)",
       x = "Theoretical values from N(0,1)")

p2 <- ggplot(dm1000, aes(x = ldl)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "slateblue", col = "cyan") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(dm1000$ldl, na.rm = TRUE), 
                            sd = sd(dm1000$ldl, na.rm = TRUE)),
                col = "magenta", lwd = 1.5) +
  labs(title = "Density Function: dm1000 LDL")

p3 <- ggplot(dm1000, aes(x = ldl, y = "")) +
  geom_boxplot(fill = "slateblue", outlier.color = "slateblue", notch = TRUE) + 
  labs(title = "Boxplot: dm1000 LDL", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

Exploring ldl in dm1000

Silenced Warnings for previous plot

Warning: Removed 178 rows containing non-finite values (stat_qq).
Warning: Removed 178 rows containing non-finite values (stat_qq_line).
Warning: Removed 178 rows containing non-finite values (stat_bin).
Warning: Removed 178 rows containing non-finite values (stat_boxplot).

So, all of those plots only show 822 observations, not 1000.

Summaries with Missing Data

dm1000 |> select(ldl) |> summary()
      ldl       
 Min.   : 25.0  
 1st Qu.: 73.0  
 Median : 95.0  
 Mean   :100.7  
 3rd Qu.:122.0  
 Max.   :243.0  
 NA's   :178    
mosaic::favstats(~ ldl, data = dm1000)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2
 min Q1 median  Q3 max     mean       sd   n missing
  25 73     95 122 243 100.7202 37.05353 822     178
  • Silence the message about what happens when you load mosaic with #| message: false.

Comparing LDL by Insurance, Try 1

ggplot(data = dm1000, aes(x = ldl, y = insurance)) +
  geom_violin() +
  geom_boxplot(width = 0.3) +
  labs(title = "LDL Cholesterol by Insurance Type",
       caption = "dm1000 tibble")
Warning: Removed 178 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 178 rows containing non-finite values (`stat_boxplot()`).

Comparing LDL by Insurance, Try 2

dm_ldlcc <- dm1000 |> filter(complete.cases(ldl, insurance))

ggplot(data = dm_ldlcc, aes(x = ldl, y = insurance)) +
  geom_violin(aes(fill = insurance)) +
  geom_boxplot(width = 0.3) +
  scale_fill_viridis_d() +
  labs(title = "LDL Cholesterol by Insurance Type",
       caption = "dm1000 tibble with complete LDL and insurance data")

Comparing LDL by Insurance, Try 3

dm_ldlcc <- dm1000 |> filter(complete.cases(ldl, insurance))

ggplot(data = dm_ldlcc, aes(x = ldl, y = insurance)) +
  geom_violin(aes(fill = insurance)) +
  geom_boxplot(width = 0.3, notch = TRUE, 
               outlier.size = 3, outlier.color = "red") +
  scale_fill_viridis_d(begin = 0.3) +
  guides(fill = "none", col = "none") +
  labs(title = "LDL Cholesterol by Insurance Type",
       caption = "dm1000 tibble with complete LDL and insurance data",
       y = "", x = "LDL Cholesterol (mg/dl)")

Comparing LDL by Insurance, Try 3

LDL by Insurance comparison

mosaic::favstats(ldl ~ insurance, data = dm1000)
   insurance min    Q1 median    Q3 max      mean       sd   n missing
1   Medicaid  25 77.00  102.0 135.0 241 108.00372 41.31084 269      61
2 Commercial  34 79.75   96.5 116.5 208  98.94643 29.67796 168      28
3   Medicare  25 68.50   87.0 115.5 243  95.32958 35.91871 355      77
4  Uninsured  42 89.50  102.5 124.0 175 109.13333 36.57560  30      12

Assessing Normality in a batch of data

Does a Normal Model fit well?

If a Normal model fits our data well, then we should see the following graphical indications:

  1. A histogram that is symmetric and bell-shaped.
  2. A boxplot where the box is symmetric around the median, as are the whiskers, without serious outliers.
  3. A normal Q-Q plot that essentially falls on a straight line.

Diastolic BP in dm1000

p1 <- ggplot(dm1000, aes(sample = dbp)) +
  geom_qq(col = "purple") + geom_qq_line(col = "red") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: dm1000 DBP",
       y = "Observed Diastolic BP (mm Hg)",
       x = "Theoretical from N(0,1)")

p2 <- ggplot(dm1000, aes(x = dbp)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "purple", col = "gold") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(dm1000$dbp, na.rm = TRUE), 
                            sd = sd(dm1000$dbp, na.rm = TRUE)),
                col = "red", lwd = 1.5) +
  labs(title = "Density Function: dm1000 DBP")

p3 <- ggplot(dm1000, aes(x = dbp, y = "")) +
  geom_boxplot(fill = "purple", outlier.color = "purple", notch = TRUE) + 
  labs(title = "Boxplot: dm1000 DBP", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
  plot_annotation("Observed DBPs in dm1000 tibble")

Diastolic BP in dm1000

Stem-and-Leaf of dbp values?

  1. Do we see any implausible diastolic blood pressures here?
stem(dm1000$dbp, scale = 0.6, width = 75)

  The decimal point is 1 digit(s) to the right of the |

   4 | 13444557789999
   5 | 000001111111111112233333333344444444445555555556666666666777777+29
   6 | 000000000000000001111111111111111112222222222222222233333333333+151
   7 | 000000000000000000000011111111111111111111111111111111111122222+238
   8 | 000000000000000000000000000000000000000011111111111111111111111+163
   9 | 000000000000001111111122222222222333444444444455555556666666788
  10 | 00000111122222333445567788
  11 | 00145
  12 | 
  13 | 7

Extreme dbp values?

Which are the subjects with unusual values of dbp?

dm1000 |>
  filter(dbp < 50 | dbp > 110) |> 
  select(subject, sbp, dbp)
# A tibble: 18 × 3
   subject   sbp   dbp
   <chr>   <dbl> <dbl>
 1 M-0018    134    49
 2 M-0072    108    49
 3 M-0104    114    44
 4 M-0142    145   111
 5 M-0144     95    45
 6 M-0161    123    45
 7 M-0175    199   114
 8 M-0267     93    48
 9 M-0298     84    47
10 M-0458    131    49
11 M-0484    146    49
12 M-0523    123    44
13 M-0619    138    43
14 M-0745    152   115
15 M-0802    109    44
16 M-0838    130    41
17 M-0862    192   137
18 M-0949    135    47

Numerical Summaries for dbp?

Which summaries seem most useful for the dm1000 dbp data?

mosaic::favstats(~ dbp, data = dm1000)
 min Q1 median Q3 max     mean       sd   n missing
  41 66     75 82 137 74.46378 12.42027 994       6
Hmisc::describe(dm1000$dbp)
dm1000$dbp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     994        6       71    0.999    74.46    13.91       54       59 
     .25      .50      .75      .90      .95 
      66       75       82       89       95 

lowest :  41  43  44  45  47, highest: 110 111 114 115 137
  • Should we use median (25th percentile, 75th percentile) or should we use mean and standard deviation to summarize these data?

LDL in dm1000 approximately Normal?

How should we think about summarizing LDL?

mosaic::favstats(~ ldl, data = dm1000)
 min Q1 median  Q3 max     mean       sd   n missing
  25 73     95 122 243 100.7202 37.05353 822     178
  • The skew makes this a bit more challenging, as the mean and median are more substantially different from each other.
  • A transformation (perhaps with a natural logarithm, since all of these values are positive) might be helpful…

Comparing LDL to log(LDL)

p1 <- ggplot(dm1000, aes(x = ldl)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "tomato", col = "gold") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(dm1000$ldl, na.rm = TRUE), 
                            sd = sd(dm1000$ldl, na.rm = TRUE)),
                col = "blue", lwd = 1.5) +
  labs(title = "Histogram: dm1000 LDL")

dm1000 <- dm1000 |> mutate(logldl = log(ldl))

p2 <- ggplot(dm1000, aes(x = logldl)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "deeppink", col = "gold") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(dm1000$logldl, na.rm = TRUE), 
                            sd = sd(dm1000$logldl, na.rm = TRUE)),
                col = "blue", lwd = 1.5) +
  labs(title = "Histogram: Natural Log of dm1000 LDL")

p1 / p2

Comparing LDL to log(LDL)

Normal Q-Q of LDL and then log(LDL)

p1 <- ggplot(dm1000, aes(sample = ldl)) +
  geom_qq(col = "tomato") + geom_qq_line(col = "black") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: dm1000 ldl",
       y = "LDL (mg/dl)", x = "")

p2 <- ggplot(dm1000, aes(sample = log(ldl))) +
  geom_qq(col = "deeppink") + geom_qq_line(col = "black") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: dm1000 log(ldl)",
       y = "Natural log of LDL (mg/dl)", x = "")

p1 + p2

Normal Q-Q of LDL and then log(LDL)

dm1000: Neighborhood Income

dm1000: Natural Logarithm of Income

dm1000: Base-10 Logarithm of Income

Histograms for dm1000

Summarizing Distribution Shapes Numerically?

Quantifying skew with \(skew_1\)

Assessing skewness with \(skew_1\) (is the mean close to the median?)

\[ skew_1 = \frac{mean - median}{standard \ deviation} \]

Interpreting \(skew_1\) (for unimodal data)

  • \(skew_1 = 0\) if the mean and median are the same
  • \(skew_1 > 0.2\) indicates fairly substantial right skew
  • \(skew_1 < -0.2\) indicates fairly substantial left skew

\(skew_1\) for some dm1000 variables

Variable \(\bar{x}\) = mean median \(s\) = SD \(skew_1\)
sbp 132.77 132 17.95 0.043
dbp 74.46 75 12.42 -0.043
age 57.20 58 10.10 -0.079
ldl 100.72 95 37.05 0.154
n_income 35178 30587 18776 0.244
  • Don’t draw conclusions without a plot!
  • Does this tell us anything about outliers?

Empirical Rule for a Normal Model

If the data follow a Normal distribution, then about:

  • 68% of the data falls within 1 standard deviation of the mean
  • 95% of the data falls within mean \(\pm\) 2 standard deviations
  • 99.7% of the data falls within mean \(\pm\) 3 standard deviations

Remember half of the data will be in the IQR for any data set.

SBPs within 1 SD of the mean?

dm1000 |>
  filter(complete.cases(sbp)) |>
  count(sbp > mean(sbp) - sd(sbp), 
        sbp < mean(sbp) + sd(sbp)) 
# A tibble: 3 × 3
  `sbp > mean(sbp) - sd(sbp)` `sbp < mean(sbp) + sd(sbp)`     n
  <lgl>                       <lgl>                       <int>
1 FALSE                       TRUE                          149
2 TRUE                        FALSE                         144
3 TRUE                        TRUE                          701
  • Note that 701/(149+144+701) = 701/994 = 0.705, approximately.
  • How does this compare to the expectation under a Normal model?

Coverage Probabilities in dm1000

Variable \(\bar{x}\) \(s\) = SD n \(\bar{x} \pm s\) \(\bar{x} \pm 2s\) \(\bar{x} \pm 3s\)
sbp 132.8 18.0 994 70.5% 95.0% 99.2%
dbp 74.5 12.4 994 67.5% 95.3% 99.7%
age 57.2 10.1 1000 64.8% 96.6% 100%
ldl 100.7 37.1 822 69.8% 95.6% 99.1%
n_income 35,178 18,776 972 75.0% 95.3% 98.5%
  • Conclusions about utility of the Normal model?
  • Do these match the conclusions from the plots? —>

Normal Q-Q plots for dm1000

Hypothesis Testing for Normality?

Don’t. Graphical approaches are far better than tests…

shapiro.test(dm1000$sbp)

    Shapiro-Wilk normality test

data:  dm1000$sbp
W = 0.98276, p-value = 1.88e-09
  • The very small p value indicates that the test finds some indications against adopting a Normal model for these data.

  • Exciting, huh? Alas, not actually useful.

Other Hypothesis Tests of Normality

The nortest package, which I don’t even install as part of our 431 packages, includes many other possible tests of Normality for a batch of data, including:

  • nortest::ad.test() Anderson-Darling test
  • nortest::lillie.test() Lilliefors test
  • nortest::cvm.test() Cramer-von Mises test
  • nortest::sf.test() Shapiro-Francia test
  • nortest::pearson.test() Pearson chi-square test

Our dm1000 data: Testing Normality

Variable S-W A-D Lill C-vM S-Fr Pear
sbp 2e-09 3e-08 6e-10 2e-06 9e-09 <2e-16
dbp 0.0001 0.030 0.033 0.078 0.0001 0.0002
  • n_income has a tiny p value for all tests, but so what?
  • So does age and so do sbp and ldl

Simulated Normally distributed data

set.seed(431)
sim150 <- rnorm(n = 150, mean = 100, sd = 10)
sim1000 <- rnorm(n = 1000, mean = 100, sd = 10)
Variable \(skew_1\) \(\bar{x} \pm s\) \(\bar{x} \pm 2s\) \(\bar{x} \pm 3s\) Outliers?
sim150 0.113 67.3% 97.3% 99.3% 1/150
sim1000 0.024 67.7% 95.9% 99.9% 4/1000

Normal Q-Q plots and Shapiro-Wilk?

  • Shapiro-Wilk: sim150 p = 0.682, while sim1000 p = 0.038

Why not test for Normality? (1)

There are multiple hypothesis testing schemes and each looks for one specific violation of a Normality assumption.

  • None can capture the wide range of issues our brains can envision, and none by itself is great at its job.
  • With any sort of reasonable sample size, the test is so poor at detecting non-normality compared to our eyes, that it finds problems we don’t care about and ignores problems we do care about.

Why not test for Normality? (2)

  • And without a reasonable sample size, the test is essentially useless.

Whenever you can avoid hypothesis testing and instead actually plot the data, you should plot the data.

Sometimes you can’t plot (especially with really big data) but the test should be your very last resort.

Does a Normal Model fit well?

Do we have…

  1. A histogram that is symmetric and bell-shaped?
  2. A boxplot where the box is symmetric around the median, as are the whiskers, without severe outliers?
  3. A normal Q-Q plot that essentially falls on a straight line?
  4. If in doubt, maybe compare mean to median re: skew, and consider Empirical Rule to help make tough calls?

Big issue: why do you need to assume a Normal model?

Dealing with Categories

Using count to create a tibble of counts

dm1000 |> count(tobacco)
# A tibble: 4 × 2
  tobacco     n
  <fct>   <int>
1 Current   274
2 Never     343
3 Former    367
4 <NA>       16

Tabulating a categorical variable

dm1000 |> tabyl(tobacco) |> 
  adorn_pct_formatting() |>
  adorn_totals() 
 tobacco    n percent valid_percent
 Current  274   27.4%         27.8%
   Never  343   34.3%         34.9%
  Former  367   36.7%         37.3%
    <NA>   16    1.6%             -
   Total 1000       -             -
  • Does this order make sense?

Changing Order of tobacco levels

dm1000 <- dm1000 |> 
  mutate(tobacco = fct_relevel(tobacco, "Current", "Former"))

dm1000 |> tabyl(tobacco)
 tobacco   n percent valid_percent
 Current 274   0.274     0.2784553
  Former 367   0.367     0.3729675
   Never 343   0.343     0.3485772
    <NA>  16   0.016            NA
  • Does this order make more sense?
  • fct_relevel() is one of many useful tools in forcats.

Using the forcats package

In addition to fct_relevel() and fct_recode(), …

  • fct_reorder(): reordering a factor by another variable
  • fct_infreq(): reordering a factor by its frequency of values
  • fct_lump(): collapsing least frequent values into “other”
  • and several others

forcats references

  1. I use the forcats tools frequently in our Course Notes
  2. https://forcats.tidyverse.org/ forcats web page, especially the vignette
  3. Posit Cheat Sheet on Factors with forcats.
  4. R for Data Science on Factors

Using gt to make a table prettier

dm1000 |> 
  tabyl(tobacco) |> 
  adorn_pct_formatting() |> 
  adorn_totals() |> 
  gt() |> 
  tab_header(title = "Tobacco Status from dm1000")
Tobacco Status from dm1000
tobacco n percent valid_percent
Current 274 27.4% 27.8%
Former 367 36.7% 37.3%
Never 343 34.3% 34.9%
NA 16 1.6% -
Total 1000 - -

gtExtras lets us build 538-style tables

dm1000 |> 
  tabyl(tobacco) |> 
  adorn_pct_formatting() |> 
  adorn_totals() |> 
  gt() |> 
  gt_theme_538() |> 
  tab_header(title = "Table styled like 538")
Table styled like 538
tobacco n percent valid_percent
Current 274 27.4% 27.8%
Former 367 36.7% 37.3%
Never 343 34.3% 34.9%
NA 16 1.6% -
Total 1000 - -

gtExtras lets us build NYT-style tables

dm1000 |> tabyl(tobacco) |> adorn_pct_formatting() |> adorn_totals() |> 
  gt() |> 
  gt_theme_nytimes() |> 
  tab_header(title = "Table styled like the New York Times")
Table styled like the New York Times
tobacco n percent valid_percent
Current 274 27.4% 27.8%
Former 367 36.7% 37.3%
Never 343 34.3% 34.9%
NA 16 1.6% -
Total 1000 - -
  • There’s also a gt_theme_espn() and several others.

Using geom_bar to show a distribution

ggplot(dm1000, aes(x = tobacco)) +
  geom_bar()

Augmenting the geom_bar result

tempdat <- dm1000 |> filter(complete.cases(tobacco))

ggplot(data = tempdat, aes(x = tobacco, fill = tobacco)) +
  geom_bar() + 
  geom_text(aes(label = ..count..), stat = "count", 
            vjust = 1.5, col = "white", size = 8) +
  scale_fill_viridis_d(option = "C", end = 0.8) +
  guides(fill = "none")

Augmenting the geom_bar result

Using count to create a tibble of counts

dm1000 |> 
  count(statin, tobacco)
# A tibble: 8 × 3
  statin tobacco     n
   <dbl> <fct>   <int>
1      0 Current    67
2      0 Former     80
3      0 Never      92
4      0 <NA>        3
5      1 Current   207
6      1 Former    287
7      1 Never     251
8      1 <NA>       13
dm1000 |> 
  count(insurance, residence)
# A tibble: 12 × 3
   insurance  residence     n
   <fct>      <fct>     <int>
 1 Medicaid   Suburbs     114
 2 Medicaid   Cleveland   201
 3 Medicaid   <NA>         15
 4 Commercial Suburbs      75
 5 Commercial Cleveland   119
 6 Commercial <NA>          2
 7 Medicare   Suburbs     163
 8 Medicare   Cleveland   259
 9 Medicare   <NA>         10
10 Uninsured  Suburbs      19
11 Uninsured  Cleveland    22
12 Uninsured  <NA>          1

Change the insurance ordering

dm1000 <- dm1000 |>
  mutate(insurance = fct_relevel(insurance, 
                                 "Medicare", "Commercial", "Medicaid"))

dm1000 |> tabyl(residence, insurance)
 residence Medicare Commercial Medicaid Uninsured
   Suburbs      163         75      114        19
 Cleveland      259        119      201        22
      <NA>       10          2       15         1

Cross-Tabulations

dm1000 |> 
  filter(complete.cases(insurance, residence)) |>
  tabyl(insurance, residence) |> 
  adorn_totals(where = c("row", "col")) |>
  gt()
insurance Suburbs Cleveland Total
Medicare 163 259 422
Commercial 75 119 194
Medicaid 114 201 315
Uninsured 19 22 41
Total 371 601 972

Were suburban residents more likely to have a statin prescription?

dm1000 |> 
  filter(complete.cases(statin, residence)) |>
  tabyl(residence, statin)
 residence   0   1
   Suburbs  77 294
 Cleveland 157 444

Revise statin order, add percentages

dm1000 |> filter(complete.cases(statin, residence)) |>
  mutate(statin = fct_relevel(factor(statin), "1", "0")) |>
  tabyl(residence, statin)
 residence   1   0
   Suburbs 294  77
 Cleveland 444 157
dm1000 |> filter(complete.cases(statin, residence)) |>
  mutate(statin = fct_relevel(factor(statin), "1", "0")) |>
  tabyl(residence, statin) |> 
  adorn_percentages(denom = "row") |>
  adorn_pct_formatting()
 residence     1     0
   Suburbs 79.2% 20.8%
 Cleveland 73.9% 26.1%

Create using table instead

tempdat1 <- dm1000 |> 
  filter(complete.cases(statin, residence)) |>
  mutate(statin = fct_relevel(factor(statin), "1", "0")) 

tab1 <- table(tempdat1$residence, tempdat1$statin)

tab1
           
              1   0
  Suburbs   294  77
  Cleveland 444 157

Assess 2x2 table

twoby2(tab1)  # twoby2() is part of the Epi package
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : 1 
Comparing : Suburbs vs. Cleveland 

            1   0    P(1) 95% conf. interval
Suburbs   294  77  0.7925    0.7482   0.8307
Cleveland 444 157  0.7388    0.7022   0.7723

                                   95% conf. interval
             Relative Risk: 1.0727    0.9996   1.1510
         Sample Odds Ratio: 1.3501    0.9903   1.8407
Conditional MLE Odds Ratio: 1.3497    0.9805   1.8679
    Probability difference: 0.0537   -0.0018   0.1065

             Exact P-value: 0.0638 
        Asymptotic P-value: 0.0577 
------------------------------------------------------

A three-by-four two-way table

dm1000 |> filter(complete.cases(tobacco, insurance)) |>
  tabyl(tobacco, insurance) |>
  adorn_totals(where = c("row", "col"))
 tobacco Medicare Commercial Medicaid Uninsured Total
 Current       99         44      118        13   274
  Former      183         70      103        11   367
   Never      140         80      105        18   343
   Total      422        194      326        42   984
  • 3 rows, 4 columns: hence, this is a 3 x 4 table
  • It’s a two-way table, because we are studying the association of two variables (tobacco and insurance)
  • Compare insurance percentages by tobacco group?

Insurance rates by tobacco group

dm1000 |> filter(complete.cases(tobacco, insurance)) |>
  tabyl(tobacco, insurance) |> 
  adorn_percentages(denominator = "row") |>
  adorn_totals(where = "col") |> kbl(digits = 3)
tobacco Medicare Commercial Medicaid Uninsured Total
Current 0.361 0.161 0.431 0.047 1
Former 0.499 0.191 0.281 0.030 1
Never 0.408 0.233 0.306 0.052 1
  • These are proportions and not percentages.
  • Proportions fall between 0 and 1: multiply by 100 for percentages.

Insurance rates by tobacco group?

tempdat2 <- dm1000 |> 
  filter(complete.cases(tobacco, insurance)) 

tab2 <- table(tempdat2$tobacco, tempdat2$insurance)

tab2
         
          Medicare Commercial Medicaid Uninsured
  Current       99         44      118        13
  Former       183         70      103        11
  Never        140         80      105        18
chisq.test(tab2)

    Pearson's Chi-squared test

data:  tab2
X-squared = 25.592, df = 6, p-value = 0.0002651

Grouped Numerical Summaries

LDL by Residence

mosaic::favstats(ldl ~ residence, data = dm1000)
  residence min Q1 median  Q3 max     mean       sd   n missing
1   Suburbs  32 73     98 123 239 101.3814 36.51435 312      59
2 Cleveland  25 74     93 122 243 100.4347 37.57225 490     111

LDL by Insurance

mosaic::favstats(ldl ~ insurance, data = dm1000)
   insurance min    Q1 median    Q3 max      mean       sd   n missing
1   Medicare  25 68.50   87.0 115.5 243  95.32958 35.91871 355      77
2 Commercial  34 79.75   96.5 116.5 208  98.94643 29.67796 168      28
3   Medicaid  25 77.00  102.0 135.0 241 108.00372 41.31084 269      61
4  Uninsured  42 89.50  102.5 124.0 175 109.13333 36.57560  30      12

LDL by Residence

ggplot(data = dm1000, aes(x = residence, y = ldl)) +
  geom_violin(aes(fill = residence)) +
  geom_boxplot(width = 0.3) +
  scale_fill_viridis_d(begin = 0.5, option = "D") +
  guides(fill = "none") 
Warning: Removed 178 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 178 rows containing non-finite values (`stat_boxplot()`).

LDL by Residence, improved

tempdat <- dm1000 |> filter(complete.cases(residence))

ggplot(data = tempdat, aes(x = residence, y = ldl)) +
  geom_violin(aes(fill = residence)) +
  geom_boxplot(width = 0.3) +
  scale_fill_viridis_d(begin = 0.5, option = "D") +
  guides(fill = "none") 

LDL by Insurance

Session Information

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2023-09-18
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 backports     1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
 base64enc     0.1-3      2015-07-28 [1] CRAN (R 4.3.0)
 bit           4.0.5      2022-11-15 [1] CRAN (R 4.3.1)
 bit64         4.0.5      2020-08-30 [1] CRAN (R 4.3.1)
 broom       * 1.0.5      2023-06-09 [1] CRAN (R 4.3.1)
 checkmate     2.2.0      2023-04-27 [1] CRAN (R 4.3.1)
 cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.1)
 cluster       2.1.4      2022-08-22 [2] CRAN (R 4.3.1)
 cmprsk        2.2-11     2022-01-06 [1] CRAN (R 4.3.1)
 colorspace    2.1-0      2023-01-23 [1] CRAN (R 4.3.1)
 crayon        1.5.2      2022-09-29 [1] CRAN (R 4.3.1)
 curl          5.0.2      2023-08-14 [1] CRAN (R 4.3.1)
 data.table    1.14.8     2023-02-17 [1] CRAN (R 4.3.1)
 digest        0.6.33     2023-07-07 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.2      2023-04-20 [1] CRAN (R 4.3.1)
 Epi         * 2.47.1     2023-04-25 [1] CRAN (R 4.3.1)
 etm           1.1.1      2020-09-08 [1] CRAN (R 4.3.1)
 evaluate      0.21       2023-05-05 [1] CRAN (R 4.3.1)
 fansi         1.0.4      2023-01-22 [1] CRAN (R 4.3.1)
 farver        2.1.1      2022-07-06 [1] CRAN (R 4.3.1)
 fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.3.1)
 fontawesome   0.5.2      2023-08-19 [1] CRAN (R 4.3.1)
 forcats     * 1.0.0      2023-01-29 [1] CRAN (R 4.3.1)
 foreign       0.8-84     2022-12-06 [2] CRAN (R 4.3.1)
 Formula       1.2-5      2023-02-24 [1] CRAN (R 4.3.0)
 generics      0.1.3      2022-07-05 [1] CRAN (R 4.3.1)
 ggforce       0.4.1      2022-10-04 [1] CRAN (R 4.3.1)
 ggformula   * 0.10.4     2023-04-11 [1] CRAN (R 4.3.1)
 ggplot2     * 3.4.3      2023-08-14 [1] CRAN (R 4.3.1)
 ggridges      0.5.4      2022-09-26 [1] CRAN (R 4.3.1)
 ggstance      0.3.6      2022-11-16 [1] CRAN (R 4.3.1)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.3.1)
 gridExtra     2.3        2017-09-09 [1] CRAN (R 4.3.1)
 gt          * 0.9.0      2023-03-31 [1] CRAN (R 4.3.1)
 gtable        0.3.4      2023-08-21 [1] CRAN (R 4.3.1)
 gtExtras    * 0.4.5      2022-11-28 [1] CRAN (R 4.3.1)
 haven         2.5.3      2023-06-30 [1] CRAN (R 4.3.1)
 highr         0.10       2022-12-22 [1] CRAN (R 4.3.1)
 Hmisc         5.1-0      2023-05-08 [1] CRAN (R 4.3.1)
 hms           1.1.3      2023-03-21 [1] CRAN (R 4.3.1)
 htmlTable     2.4.1      2022-07-07 [1] CRAN (R 4.3.1)
 htmltools     0.5.6      2023-08-10 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.1)
 httr          1.4.7      2023-08-15 [1] CRAN (R 4.3.1)
 janitor     * 2.2.0      2023-02-02 [1] CRAN (R 4.3.1)
 jsonlite      1.8.7      2023-06-29 [1] CRAN (R 4.3.1)
 kableExtra  * 1.3.4      2021-02-20 [1] CRAN (R 4.3.1)
 knitr         1.43       2023-05-25 [1] CRAN (R 4.3.1)
 labeling      0.4.3      2023-08-29 [1] CRAN (R 4.3.1)
 labelled      2.12.0     2023-06-21 [1] CRAN (R 4.3.1)
 lattice     * 0.21-8     2023-04-05 [2] CRAN (R 4.3.1)
 lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.2      2023-02-10 [1] CRAN (R 4.3.1)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.1)
 MASS          7.3-60     2023-05-04 [2] CRAN (R 4.3.1)
 Matrix      * 1.6-1      2023-08-14 [1] CRAN (R 4.3.1)
 mgcv          1.8-42     2023-03-02 [2] CRAN (R 4.3.1)
 mosaic        1.8.4.2    2022-09-20 [1] CRAN (R 4.3.1)
 mosaicCore    0.9.2.1    2022-09-22 [1] CRAN (R 4.3.1)
 mosaicData  * 0.20.3     2022-09-01 [1] CRAN (R 4.3.1)
 munsell       0.5.0      2018-06-12 [1] CRAN (R 4.3.1)
 naniar      * 1.0.0      2023-02-02 [1] CRAN (R 4.3.1)
 nlme          3.1-162    2023-01-31 [2] CRAN (R 4.3.1)
 nnet          7.3-19     2023-05-03 [2] CRAN (R 4.3.1)
 numDeriv      2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.0)
 paletteer     1.5.0      2022-10-19 [1] CRAN (R 4.3.1)
 patchwork   * 1.1.3      2023-08-14 [1] CRAN (R 4.3.1)
 pillar        1.9.0      2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.3.1)
 plyr          1.8.8      2022-11-11 [1] CRAN (R 4.3.1)
 polyclip      1.10-4     2022-10-20 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2      2023-08-10 [1] CRAN (R 4.3.1)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.1)
 Rcpp          1.0.11     2023-07-06 [1] CRAN (R 4.3.1)
 readr       * 2.1.4      2023-02-10 [1] CRAN (R 4.3.1)
 rematch2      2.1.2      2020-05-01 [1] CRAN (R 4.3.1)
 rlang         1.1.1      2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown     2.24       2023-08-14 [1] CRAN (R 4.3.1)
 rpart         4.1.19     2022-10-21 [2] CRAN (R 4.3.1)
 rstudioapi    0.15.0     2023-07-07 [1] CRAN (R 4.3.1)
 rvest         1.0.3      2022-08-19 [1] CRAN (R 4.3.1)
 sass          0.4.7      2023-07-15 [1] CRAN (R 4.3.1)
 scales        1.2.1      2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.1)
 snakecase     0.11.1     2023-08-27 [1] CRAN (R 4.3.1)
 stringi       1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0      2022-12-02 [1] CRAN (R 4.3.1)
 survival      3.5-5      2023-03-12 [2] CRAN (R 4.3.1)
 svglite       2.1.1      2023-01-10 [1] CRAN (R 4.3.1)
 systemfonts   1.0.4      2022-02-11 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1      2023-03-20 [1] CRAN (R 4.3.1)
 tidyr       * 1.3.0      2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse   * 2.0.0      2023-02-22 [1] CRAN (R 4.3.1)
 timechange    0.2.0      2023-01-11 [1] CRAN (R 4.3.1)
 tweenr        2.0.2      2022-09-06 [1] CRAN (R 4.3.1)
 tzdb          0.4.0      2023-05-12 [1] CRAN (R 4.3.1)
 utf8          1.2.3      2023-01-31 [1] CRAN (R 4.3.1)
 vctrs         0.6.3      2023-06-14 [1] CRAN (R 4.3.1)
 viridisLite   0.4.2      2023-05-02 [1] CRAN (R 4.3.1)
 visdat        0.6.0      2023-02-02 [1] CRAN (R 4.3.1)
 vroom         1.6.3      2023-04-28 [1] CRAN (R 4.3.1)
 webshot       0.5.5      2023-06-26 [1] CRAN (R 4.3.1)
 withr         2.5.0      2022-03-03 [1] CRAN (R 4.3.1)
 xfun          0.40       2023-08-09 [1] CRAN (R 4.3.1)
 xml2          1.3.5      2023-07-06 [1] CRAN (R 4.3.1)
 yaml          2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
 zoo           1.8-12     2023-04-13 [1] CRAN (R 4.3.1)

 [1] C:/Users/thoma/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────